6 research outputs found

    SQ Lower Bounds for Learning Bounded Covariance GMMs

    Full text link
    We study the complexity of learning mixtures of separated Gaussians with common unknown bounded covariance matrix. Specifically, we focus on learning Gaussian mixture models (GMMs) on Rd\mathbb{R}^d of the form P=i=1kwiN(μi,Σi)P= \sum_{i=1}^k w_i \mathcal{N}(\boldsymbol \mu_i,\mathbf \Sigma_i), where Σi=ΣI\mathbf \Sigma_i = \mathbf \Sigma \preceq \mathbf I and minijμiμj2kϵ\min_{i \neq j} \| \boldsymbol \mu_i - \boldsymbol \mu_j\|_2 \geq k^\epsilon for some ϵ>0\epsilon>0. Known learning algorithms for this family of GMMs have complexity (dk)O(1/ϵ)(dk)^{O(1/\epsilon)}. In this work, we prove that any Statistical Query (SQ) algorithm for this problem requires complexity at least dΩ(1/ϵ)d^{\Omega(1/\epsilon)}. In the special case where the separation is on the order of k1/2k^{1/2}, we additionally obtain fine-grained SQ lower bounds with the correct exponent. Our SQ lower bounds imply similar lower bounds for low-degree polynomial tests. Conceptually, our results provide evidence that known algorithms for this problem are nearly best possible

    Estimating the Number of Induced Subgraphs from Incomplete Data and Neighborhood Queries

    No full text
    We consider a natural setting where network parameters are estimated from noisy and incomplete information about the network. More specifically, we investigate how we can efficiently estimate the number of small subgraphs (e.g., edges, triangles, etc.) based on full access to one or two noisy and incomplete samples of a large underlying network and on few queries revealing the neighborhood of carefully selected vertices. After specifying a random generator which removes edges from the underlying graph, we present estimators with strong provable performance guarantees, which exploit information from the noisy network samples and query a constant number of the most important vertices for the estimation. Our experimental evaluation shows that, in practice, a single noisy network sample and a couple of hundreds neighborhood queries suffice for accurately estimating the number of triangles in networks with millions of vertices and edges

    Streaming Algorithms for High-Dimensional Robust Statistics

    No full text
    We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization

    Robust Sparse Mean Estimation via Sum of Squares

    Full text link
    We study the problem of high-dimensional sparse mean estimation in the presence of an ϵ\epsilon-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on Rd\mathbb R^d with "certifiably bounded" tt-th moments and sufficiently light tails, our algorithm achieves error of O(ϵ11/t)O(\epsilon^{1-1/t}) with sample complexity m=(klog(d))O(t)/ϵ22/tm = (k\log(d))^{O(t)}/\epsilon^{2-2/t}. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of O~(ϵ)\tilde O(\epsilon) with sample complexity m=O(k4polylog(d))/ϵ2m = O(k^4 \mathrm{polylog}(d))/\epsilon^2. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.Comment: To appear in COLT 202

    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering

    Full text link
    We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter α(0,1/2)\alpha \in (0, 1/2), we are given mm points in Rn\mathbb{R}^n, αm\lfloor \alpha m \rfloor of which are i.i.d. samples from a distribution DD with unknown kk-sparse mean μ\mu. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector μ^\widehat \mu such that μ^μ2\| \widehat \mu - \mu \|_2 is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' tt-th moments in kk-sparse directions and sufficiently light tails, our algorithm achieves error of (1/α)O(1/t)(1/\alpha)^{O(1/t)} with sample complexity m=(klog(n))O(t)/αm = (k\log(n))^{O(t)}/\alpha and running time poly(mnt)\mathrm{poly}(mn^t). For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of Θ(log(1/α))\Theta (\sqrt{\log(1/\alpha)}) with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds
    corecore